arXiv: 1501.0671 lv4 [math.OC] 27 Sep 2016 


Decomposable Norm Minimization with Proximal-Gradient Homotopy 
Algorithm 


Reza Eghbali • Maryam Fazel 


Abstract We study the convergence rate of the proximal-gradient homotopy algorithm applied to norm- 
regularized linear least squares problems, for a general class of norms. The homotopy algorithm reduces the 
regularization parameter in a series of steps, and uses a proximal-gradient algorithm to solve the problem at 
each step. Proximal-gradient algorithm has a linear rate of convergence given that the objective function is 
strongly convex, and the gradient of the smooth component of the objective function is Lipschitz continuous. 
In many applications, the objective function in this type of problem is not strongly convex, especially when the 
problem is high-dimensional and regularizers are chosen that induce sparsity or low-dimensionality. We show 
that if the linear sampling matrix satisfies certain assumptions and the regularizing norm is decomposable, 
proximal-gradient homotopy algorithm converges with a linear rate even though the objective function is 
not strongly convex. Our result generalizes results on the linear convergence of homotopy algorithm for 
l \-regularized least squares problems. Numerical experiments are presented that support the theoretical 
convergence rate analysis. 

Keywords Proximal-Gradient • Homotopy ■ Decomposable norm 


1 Introduction 

In signal processing and statistical regression, problems arise in which the goal is to recover a structured 
model from a few, often noisy, linear measurements. Well studied examples include recovery of sparse vectors 
and low rank matrices. These problems can be formulated as non-convex optimization programs, which are 
computationally intractable in general. One can relax these non-convex problems using appropriate convex 
penalty functions, for example £ 1 , £ 1,2 and nuclear norms in sparse vector, group sparse and low rank matrix 
recovery problems. These relaxations perform very well in many practical applications. Following Il0l l6|l8j. 
there has been a flurry of publications that formalize the condition for recovery of sparse vectors, e.g., [2j 
142] . low rank matrices, e.g., mmm from linear measurements by solving the appropriate relaxed convex 
optimization problems. Alongside results for sparse vector and low rank matrix recovery several authors have 
proposed more general frameworks for structured model recovery problems with linear measurements [51191 
[27]. In many problems of interest, to recover the model from linear noisy measurements, one can formulate 
the following optimization program: 


minimize ||cc|| (1) 

subject to ||Ax — 6 ||\ < e 2 , 
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where b £ R m is the measurements vector, A £ R" 1 *” }i near measurement matrix, e 2 is the noise 

energy and ||-|| is a norm on R" that promotes the desired structure in the solution. The regularized version 
of problem © has the following form: 


minimize — 


&II2 + AM, 


( 2 ) 


where A > 0 is the regularization parameter. 

There has been extensive work on algorithms for solving problem © and m in special cases of i\ and 
nuclear norms. First order methods have been the method of choice for large scale problems, since each 
iteration is computationally cheap. Of particular interest is the proximal-gradient method for minimization 
of composite functions, which are functions that can be written as sum of a differentiable convex function 
and a closed convex function. Proximal-gradient method can be utilized for solving the regularized problem 
©. 

When the smooth component of the objective function has a Lipschitz continuous gradient, proximal- 
gradient algorithm has a convergence rate of 0(l/f), where t is the iteration number. For the accelerated 
version of proximal-gradient algorithm, the convergence rate improves to 0(l/f 2 ). When the objective func¬ 
tion is strongly convex as well, proximal-gradient has linear convergence, i.e. 0(K t ) with k £ (0,1) [251 . 
However, in instances of problem © that are of interest, the number of samples is less than the dimension 
of the space, hence the matrix A has a non-zero null space which results in an objective function that is 
not strongly convex. Several algorithms that combine homotopy continuation over A with proximal-gradient 
steps have been proposed in the literature for problem © in the special cases of £1 and nuclear norms m 
mrn[\rzm\ . Xiao and Zhang [33 have studied an algorithm with homotopy with respect to A for solv¬ 
ing £\ regularized least squares problem. Formulating their algorithm based on Nesterov’s proximal-gradient 
method, they have demonstrated that this algorithm has an overall linear rate of convergence whenever A 
satisfies the restricted isometry property (RIP) and the final value of the regularizer parameter A is greater 
than a problem-dependent lower bound. 


1.1 Our result 

We generalize the linear convergence rate analysis of the homotopy algorithm studied in |45j to problem 
© when the regularizing norm is decomposable, where decomposability is a condition introduced in [5]. 
In particular, £\, ti .2 and nuclear norms satisfy this condition. We derive properties for this class of norms 
that are used directly in the convergence analysis. These properties can independently be of interest. Among 
these properties is the sublinearity of the the function K : R n 1 —> {0,1,..., n}, where K is generalization of 
the notion of cardinality for decomposable norms. 

The linear convergence result holds under an assumption on the RIP constants of A, which in turn holds 
with high probability for several classes of random matrices when the number of measurements m is large 
enough (orderwise the same as that required for recovery of the structured model). 


1.2 Algorithms for structured model recovery 

There has been extensive work on algorithms for solving problems © and © in the special cases of i\ and 
nuclear norms. For a detailed review of first order methods we refer the reader to [30j and references therein. 
In [45j . authors have reviewed sparse recovery and £\ norm minimization algorithms that are related to the 
homotopy algorithm for t\ norm. We discuss related algorithms mostly focusing on algorithms for other 
norms including nuclear norm here. 

Proximal-gradient method for £\ /nuclear norm minimization has a local linear convergence in a neighbor¬ 
hood of the optimal value mmm- The proximal operator for nuclear norm is soft-thresholding operator 
on singular values. Several authors have proposed algorithms for low rank matrix recovery and matrix com¬ 
pletion problem based on soft- or hard-thresholding operators; see, e.g., mmrnm- The singular value 
projection algorithm proposed by Jain et al. has a linear rate; however, to apply the hard-thresholding oper¬ 
ator, one should know the rank of Xq- While the authors have introduced a heuristic for estimating the rank 
when it is not known a priori, their convergence results rely upon a known rank m ■ SVP is the generalization 
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of iterative hard thresholding algorithm (IHT) for sparse vector recovery. SVP and IHT belong to the family 
of greedy algorithms which do not solve a convex relaxation problem. Other greedy algorithms proposed 
for sparse recovery such as Compressive Sampling Matching Pursuit (CoSaMP) ©5] and Fully Corrective 
Forward Greedy Selection (FCFGS) [3S] have also been generalized for recovery of general structured models 
including low-rank matrices and extended to more general loss functions mm- 

For huge-scale problems with separable regularizing norm such as and coordinate descent meth¬ 
ods can reduce the computational cost of each iteration significantly. The convergence rate of randomized 
proximal coordinate descent method in expectation is orderwise the same as full proximal gradient descent; 
however, it can yield an improvement in terms of the dependence of convergence rate on n |281I351[20] . To the 
best of our knowledge, linear convergence rate for any coordinate descent method applied to problem m or 
© has not been shown in the literature. 

Continuation over A for solving the regularized problem has been utilized in fixed point continuation 
algorithm (FPC) proposed by Ma et al. J22] and accelerated proximal-gradient algorithm with line search 
(APGL) by Toh et al. [31]. FPC and APGL both solve a series of regularized problems where in each 
outer-iteration A is reduced by a factor less than one, the former uses soft-thresholding and the latter uses 
accelerated proximal-gradient for solving each regularized problem. 

Agarwal et al. [I] have proposed algorithms for solving problems m and © with an extra constraint 
in the form of ||a;|| < p. They have introduced the assumption of decomposability of the norm and given 
convergence analysis for norms that satisfy that assumption. They establish linear rate of convergence for 
their algorithms up to a neighborhood of the optimal solutions. Flowever, their algorithm uses the bound 
p which should be selected based on the norm of the true solution. In many problems this quantity is not 
known beforehand. Jin et al. jlB] have proposed an algorithm for t\ regularized least squares that receives 
p as a parameter and has linear rate of convergence. Their algorithm utilizes proximal gradient method but 
unlike homotopy algorithm reduces A at each step. 

By using SDP formulation of nuclear norm, interior point methods can be utilized to solve problems © 
and 0- Interior point methods do not scale as well as first order methods for large scale problems (For 
example, for a general SDP solver when the dimension exceeds a few hundreds). However, Specialized SDP 
solvers for nuclear norm minimization can bring down the computational complexity of each iteration to 
0(n 3 ) [T8|. 


2 Preliminaries 

Let A £ R mxra . We equip R™ by an inner product which is given by (a :,y) = x T By for some positive 
definite matrix B. We equip R m with ordinary dot product ( v , u) = v T u. We denote the adjoint of A with 
A* = B~ 1 A t . Note that for all x £ R" and u £ R m 

(Ax,u) = (x, A*u). (3) 

We use ||-|| 2 to denote the norms induced by the inner product in R" and R m , that is: 

Va; £ R™ : ||a;|| 2 = Vx T Bx, 

Vv £ R m : |M| 2 = Vv T v. 

We use ||-|| and ||-||* to denote a regularizing norm and its dual on R”. The latter is defined as: 

IMI* = SU P {{y, x) | ||a;|| < 1}. 

Given a convex function / : R” i-»- R, df (x) denotes the set of subgradients of / at x, i.e., the set of all 
z £ R" such that 

Vy £ R" : f{y) > f{x) + (z,y ~ x). 

When / is differentiable, df (x) = {V/(a:)}. Note that £ £ <9||a:|| if and only if 


(£,z) = || aa ||, 


(4) 

(5) 
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We say / is strongly convex with strong convexity parameter pf when f (x) — ^-||x||^ is convex. For a 
differentiable function this implies that for all x,y £ K”: 

f(y) >f(x) + {Vf(x),y-x) + ^L\\x-y\\ 2 2 . (6) 

We call the gradient of a differentiable function Lipschitz continuous with Lipschitz constant Lf , when 
for all 

IIV/ (x) — V/ (y)|| 2 < Lf\\y — x\\ 2 . (7) 

For a convex function /, gradient Lipschitz continuity is equivalent to the following inequality [see m 
Lemma 1.2.3. and Theorem 2.1.5]: 

f(y) < /0) + (V/ (x) ,y — x) + ^-\\x-y\\ 2 2 , (8) 

for all x, y £ l n . 


3 Properties of the regularizing norm and A 

In this section we introduce our assumptions on the regularizing norm ||-1|, and derive the properties of the 
norm based on these assumptions. The homotopy algorithm of |45j for the l \-regularized problem is designed 
so that the iterates maintain low cardinality throughout the algorithm, therefore one can use the restricted 
eigenvalue property of A, when A acts on these iterates. Said another way, the squared loss term behaves 
like a strongly convex function over the algorithm iterates, which is why the algorithm can achieve a fast 
convergence rate. In the proof, [?5] uses the the structure of the subdifferential of the i\ norm, 

d|Mli = (sgn(x) + v \ V; = 0 when x t ^ 0, IHI^ < 1}, 

as well as the following properties that hold for the cardinality function, 

IMIi < card(x)||x|| 2 , 

card(x + y) < card(x) + card(y) (sublinearity). 

We first give our assumption on the structure of the subdifferential of a class norms (which inlcudes l\ and 
nuclear norms but is much more general), and then derive the rest of the properties needed for generalizing 
the results of [45] . 

Before stating our assumptions, we add some more definitions to our tool box. Let S n ~ l = {x £ 
K™ | ||a;|| 2 = 1}, and let Q^.n be the set of extreme points of the norm ball Z3||.|| := {x \ ||x|| < 1}. We 
impose two conditions on the regularizing norm. 

Condition 1 For any x £ G\\-\\, ||*||2 = 1 > *- e v a M th e extreme points of the norm ball have unit H-l^-uorm. 

The second condition on the norm is the decomposability condition introduced in [5] , which was inspired 
by the assumption introduced in m- 

Condition 2 (Decomposability) For all x £ there exists a subspace T x and a vector e x £ T x such 
that 


d\\x\\={ e x +v\v£T x ^,\\vr<l}. (9) 

Note that x £ T x for all x £ R™ because if a; ^ T x , then x = y + z with y £ T x and 2 £ T x — {0}. Let 
z' = 2 /|| 2 ||*. Since e x + z' £ 9||x||, ||x|| = (e x + z',y + z) = ||x|| + IMI 2 /IMI*, which is a contradiction. 

The decomposability condition has been used in both [Q and [27] to give a simpler and unified proof for 
recovery of several structures such as sparse vectors and low-rank matrices. 

When attempting to extend this algorithm to general norms, several challenges arise. First, what is the 
appropriate generalization of cardinality for other structures and their corresponding norms? Essentially, we 
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would need to count the number of nonzero coefficients in an appropriate representation and ensure there is 
a small number of nonzero coefficients in our iterates, to be abie to apply a similar proof idea as in |45j . 

The next theorem captures one of our main results for any decomposable norm. This theorem provides 
a new set of conditions that is based on the geometry of the norm ball, and we show are equivalent to 
decomposability on R n . As a result, one can find a decomposition for any vector in R™ in terms of an 
orthogonal subset of f/|[.||. 

Theorem 1 (Orthogonal representation) Suppose C S'™ -1 , then ||-|| is decomposable if and only if 
for any x £ R™ — {0} and a\ £ argmax agg|| ^ (a, x) there exist a^, ■ ■ ■, a*, £ !?||.|| such that {ai, 02 ,...,a&} is 
an orthogonal set that satisfies the following conditions: 

I There exists {yi >0|* = 1 ,.such that: 

k 

X = $>«*, 

2=1 

k 

INI = £t*- ( 10 ) 

2=1 


II For any set {rji \ \rji\ < 1, i = 1,..., k}: 


k 

£ 


'TjiQ'i 


< 1 . 


Moreover, if {ai, a%, ..., a^} C satisfy I and II, then e x = ^t=i a i- 
The proof of Theorem [Tj is presented in Appendix B. 


( 11 ) 


We will see in section [5] that we need an orthogonal representation for all vectors to be able to bound the 
number of nonzero coefficients throughout the algorithm. First, we define a quantity K(x) that bounds the 
ratio of the norm || • || to the Euclidean norm, and plays the same role in our analysis as cardinality played 
in [45]. Then we show that K{x) is a sublinear function, that is, K{x + y) < K(x) + K(y) for all x, y. This 
is a key property that is needed in the convergence analysis. Define K : R™ i—► {0, 1,2,... ,n} 


K W — 11^||2 ■ 

Note that for every x £ R n , 

IN| 2 = (e,,a ; ) 2 <||e,||^|N|^ = A>)lkll2- (12) 

Here, the first equality follows from (|4]), and the inequality follows from the Cauchy-Schwarz inequality. 
In the analysis of homotopy algorithm we utilize (1121) alongside the structure of the subgradient given by 
©•4,4,2, and nuclear norms are three important examples that satisfy conditions [l] and [2] Here we briefly 
discuss each one of these norms. 

— Nuclear norm on R d i xd 2 is defined as 

min {di ,^ 2 } 

11*11. = £ po 
2 = 1 

Where cq (A) is the i th largest singular value of X given by the singular value decomposition X = 
E" 1 ' 1 *}^!)^. With the trace inner product (X,Y) = trace (A t F), nuclear norm satisfies 
conditions [Hand [2] In this case, K(X) = rank(X), 7 * = cq (X) and a* = Uivf for i £ (1,2,..., rank(X)}. 
The subspace Tx is given by: 


( rank(X) 

) 

^2 u i z I + z 'i v I 

Zi £ R d2 , z'i £ R dl , for all i 


i -i ^rank(X) 7 1 

while ex = Z^=i u i v i • 
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— Weighted l\ norm on R" is defined as: 

n 

IMIi = ^2wi\Xi\ 

2=1 

where w is a vector of positive weights. With (x,y) = w iXiyi, ^1 norm satisfies conditions Q] and [2] 
For £i norm, K(x) = \{i\xi £ 0}|, { 71 , 72 , • • • ,7fc} = {wi\xi\ \ \xi\ > 0, i = 1,..., n}. T x is the support of 
x, which is defined as: 

T x = {y £ R" | yi = 0 if x t = 0}, 
while the i th element of e x is sign(xj)w;j. 

— £ 1,2 norm on R dlXd2 : p or a gj ven inner product (•,•) : R dl x R dl 1 —> R and its induced norm ||-|| 2 on 
R^ 1 , We define: 

l|Ali, 2 = E 11 ^ 112 , 

2=1 

where X, denotes the i th column of A'. With inner product (A, Y) = Yhi=i (A*, Y)), ^ 1,2 norm satisfies 
conditions [□and [2] For this norm, K( A) = |{*|A, ; ^ 0}| and { 71 ,72, • • •,7fc} = {||^i|| 2 I l|A,:|| 2 > 0,* = 
1,..., d 2 }. Tx is the column support of A, which is defined as: 

T x = { \Yi,Y 2 , ...,Y d2 ]G R dlXd2 | Yi = 0 if A, = 0} , 

while the i th column of ex is equal to 0 if X* = 0 and is equal to Xj/||Xj|| 2 otherwise. 

Our second result on properties of decomposable norms is captured in the next theorem which establishes 
sublinearity of K for decomposable norms. 

Theorem 2 For all x, y £ 1" 

K [x + y) < K (x) + K (■ y ). (13) 

Theorem [5] for ti, £i i2 and nuclear norm is equivalent to sublinearity of cardinality of vectors, number of 
non-zero columns and rank of matrices. The proof of this theorem is included in Appendix B. 


3.1 Properties of A 

Restricted Isometry Property was first discussed in JB] for sparse vectors. Generalization of that concept 
to low rank matrices was introduced in [33J. Note that if K (x) < fc, then ||x|| < \/fc||x|| 2 . Based on this 
observation we define restricted isometry constants of A € R mxn as: 

Definition 1 The upper (lower) restricted isometry constant p+ (A, k ) (p_ (A, k)) of a matrix A £ R mxn j s 
the smallest (largest) positive constant that satisfies this inequality: 

P- (A k) \\x\\l < \\Ax\\l < p+ (A, fc) |||| 2 , 

whenever ||x|| 2 < fc||x|| 2 . 

Proposition 1 Let A £ R mxn an d f (x) = \\\Ax — &|| 2 . Suppose that p + (A,k) and p- (A, fc) are restricted 
isometry constants corresponding to A, then: 

f(y)>f (x) + (V/ (x), y - x) + (A, fc) ||x - y|| 2 , (14) 

f(y)<f 0) + (V/ (x) ,y-x} + (A, fc) ||x - yf 2 , (15) 

for all x, y £ R" such that ||x — y|| 2 < fc||x — y|| 2 . 

Proposition m follows from the definition of restricted isometry constants and the following equality: 

i||A(x-y)|| 2 = f(y)-f(x)~ (V/(x),y-x). 
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4 Proximal-gradient method and homotopy algorithm 


We state the proximal-gradient method and the homotopy algorithm for the following optimization problem: 


minimize <fi\ (x) = / (x) + A||x||, 

where / (x) = A||Ax — 6||While, for simplicity, we analyze the homotopy algorithm for the least squares 
loss function, the analysis can be extended to every function of form f(x) = g(Ax) when g is a differentiable 
strongly convex function with Lipschitz continuous gradient.. The key element in the proximal-gradient 
method is the proximal operator which was developed by Moreau [25] and later extended to maximal mono¬ 
tone operators by Rockafellar [36]. Nesterov has proposed several variants of the proximal-gradient methods 
[29] . In this section, we discuss the gradient method with adaptive line search. For any x, y £ R” and positive 
L , we define: 

m \,L {y, x) = f (y) + (V/ (y) , x - y) + ^||x - yf 2 + A||x||, 

Prox A ,L (y) = argmin m A ,i (y,x) 

xER n 

w A (x) = min || A£ + V/ (x)||*. 

€e£>|M| 


Xiao and Zhang |45j have considered the proximal-gradient homotopy algorithm for t\ norm. Here we 
state it for general norms. Algorithm ([ 1 ]), introduces the homotopy algorithm and contains the proximal- 
gradient method as a subroutine. The stopping criteria in the proximal-gradient method is based on the 
quantity 

M t (x (t_1) - x (t) ) + V/(x (4) ) - V/(x (t “ 1) ) *, 

which is an upper bound on w A ( x ^). This follows from the fact that since x ® = argmim [ , gR „ m Aj M t (x^~^ , x ), 
there exists £ £ 9||x^^|| such that V/(x^ -1 ^) + A£ + M t (x® — = 0. Therefore, 


VI 

< 

3 

A C + V/(x (t) ) 

* 

A£ + V/(x (t “ 1) ) + V/(x (t) ) - V/(x (t - 1} ) 



< 

M t (x (t - X ) - x (t) ) + V/(x<*>) - V/(s (t_1) ) 


(16) 


The homotopy algorithm reduces the value of A in a series of steps and in each step applies the proximal- 
gradient method. At step t, A t = Ao rf and e t = S'X t with £ (0,1) and S' £ (0,1). In the proximal-gradient 
method and the backtracking subroutine, the parameters 7 d ec > 1 and 7 i nc > 1 should be initialized. Since 
the function / satisfies the inequality ©, it is clear that L mm should be chosen less than Lf. 

Theorem 5 in [25] states that the proximal-gradient method has a linear rate of convergence when / 
satisfies m and ([5]) . In proposition [2] we restate that theorem with minimal assumptions which is / satisfies 
d 6 ]) and (0 on a restricted set. The proof of this proposition is given in appendix B. 


Proposition 2 Let x* £ argmin (j>\. If for every t: 


then 


In addition, if 


2 


/ (z W ) >f(x*) + (V/ (x*), x w - x*) + 

/ (> +1) ) > / (x«) + (V/ (x«) ,x^ - *«) + 
/ (x {t+1) ^ < f (x w ) + (V/ (x w ) ,x (t+1) - x (t) ) + 


X ^ — x* 




Lt 


x« - x^ 


x (t) - 


<t>\ - <t>\ {x*) 


< 1 - 


M/7in< 

4 Lf 


(</>a (z (0) ) - < t >\ (x*)) . 


V/ (*W) - V/ (x(‘ +1 >) 


<L’ f 


x (t) _ x (t+ 1 ) 


(17) 

(18) 

(19) 

( 20 ) 
( 21 ) 
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Algorithm 1 Homotopy 
Input: Atgt > 0, £ > 0 

Parameters: r 7 S (0,1), 5' £ (0,1), L m ; n > 0 

J/(°) <- 0, A 0 <- ||-A*6||*, M «- L min , TV «- Llog / log (r,)j 

for t = 0,1,... , N — 1 do 
At+i <- r/A t 

et t— t^A* 

[y( t+ 1 ),Af] <- ProxGrad_0A t+i M, L min , et) 

end for 

[y,M] ProxGrad_(/) Atgt (y( N \M, L min , e) _ 

Subroutine 1 [x,M] = ProxGrad 0A Lp, L m j n , e ; ) 

Parameter: 7a e c > 1? 
t 4 — 0 
repeat 

[x(* - *" 1 ), Mt_(_i] •<—Backtrack_ 0 A (^x^\Lt) 

Lt -\-1 ^ max-{Z/min5 -^t+l/Tdec} 

t t + 1 

until ||Mt(cc( t_1 ) — #(*)) + V/ — V/ ||* < e' 

a: «— , M •<— M* 

Subroutine 2 [?/, M] = Backtrack.^ (x,L) 

Parameter: 7 i nc > 1 

while cf>x (Pro x x ,L (x)) > ™\,L (x, Prox A:i (a;)) do 
t ‘7inc^ 

end while 

j/ <- Prox Aii (a:), M <- L 


and 


X (t) _ a;(‘ + l) 


< 




/or some constants 9 and L'p then 

uj\ (x (t+1) ) < | A/ t +i(a: (t) - z (t+1) ) + V/ (z (i+1) ) - V/ (V 4) ) 
< 0 ^1 + 2jmcLf (4>x (z (t) ) - <j)\ (x*)). 


( 22 ) 


(23) 


5 Convergence result 


First note that since the objective function is not strongly convex if one applies the sublinear convergence 
rate of proximal gradient method, the iteration complexity of the homotopy algorithm is 0(j + Ylt =i TX/) 
which can be simplified to 0(\ + t )• As d was stated dr the introduction, we use the structure 

of this problem to provide a linear rate of convergence when assumptions similar to those needed to derive 
recovery bounds hold. 

Suppose b = Ax o + z, for some Xq G R" and 2 G R m . Here, 2 is the noise vector that is added to linear 
measurements from an structured model Xq- Also, we define ko := K(x o) and the constant c: 


c := 



J.J.J.GL.A. r) . 

*er* 0 -{o} AsollarHa 


Note that c = 1 for i\ and £ 1.2 norms, and c < 2 for nuclear norm. This follows from the fact that 
K(x) = ko when x G T Xo for i\, £^2 norms, while K(x) < 2fco when x € T Xo in case of nuclear norm. 
Through out this section, we assume the regularizing norm satisfies conditions Q] and [2] introduced in Section 
[3j Before we state the convergence theorem, we introduce an assumption: 
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Assumption 1 At g t is such that ||A*z||* < Furthermore, there exist constants r > 1 and S £ (0, A] 


such that: 


P- (A, ck 0 (1 + 7 ) 2 ) 

P+ (A, 72rck 0 (l + 7)7inc) " r 
p- ( A, 72rck 0 (l + 7)7inc) > 0 


c 

> - 


where: 


7 := 


Atgt (! + £) + ||A*z||* 

A tgt (l-S)-||A**r 


(24) 

(25) 

(26) 


We define k = 36rcfco(l + 7)7inc- In appendix A, we provide an upper bound on the number of mea¬ 
surement needed for (1241) to be satisfied with high probability whenever rows of A are sub-Gaussian random 
vectors. 

The next theorem establishes the linear convergence of the proximal gradient method when ui\ (aA°l) = 
min^ e9 ii (0) || || V/ ( x) + A£||* is sufficiently small, while Theorem H] establishes the overall linear rate of con¬ 
vergence of homotopy algorithm. 

Theorem 3 Let denote the t th iterate of ProxGrad-fx (x^°\ L 0 , L m ; n ,e'), and let x* £ argmin (p\ (x). 
Suppose Assumption [7] holds true for some r and S, L m i n < 7i n cP+ ^A, 2 k'j , and A > Atgt- If x^ satisfies: 


K < k, uj\ < (5A, 


then: 


and 


WA 


(*“’) 


< 


4 > A ( x(t) ) - {%*) 

( 

1 

V 


K 

<(l- 


< k , uj\ 
(,<•») 


< k, 


h) -*(*•)). 


47incK 


(27) 

(28) 


\jp+ (A, 1) p+ ( 

A, 2 kj 

P- 1 

(A, 2k 

) 


' 27 ; nc p + (A,2kj (</> A ( X !)) - (fx (a:*)), (29) 


, P+(A,2fc) 

where n = — 7 -— A. 

p-{A,2k) 


Theorem 4 Let denote the t th iterate of Homotopy algorithm, and let y* £ argmin </>A tgt iu)- Suppose 
Assumption [T| holds true for some r and 6, L m ; n < 7 incP+ ^A, 2 k'j, and Ao > A t g t ■ Furthermore, suppose that 
S' and y in the algorithm satisfy: 

1 + 6' 


1 + <5 


< V- 


(30) 


When t = 0,1,..., N — 1, the number of proximal-gradient iterations for computing y ^ is bounded by 

log {C/5 2 ) 


-1 ■ 


lo § i 1 - Fihk) 

The number of proximal-gradient iterations for computing y is bounded by 

log (CAtgt/e 2 ) 


(31) 


l0 § ( X - 4^) 


-1 ’ 


(32) 
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where C := &y lnc nSck 0 (1 + 7 ) (^\JP- (X 2 ^) + Vp+ (A 0 / P- (X c (1 + 7) 2 


T/ie objective gap of the output y is bounded by 


and k = 


p+(a, 2k) 

( A,2k ) 


^A tgt (2/) - ^A tgt (2/*) < 


9cfc 0 A t gt (1 + 7 ) e 
P- {a, c (1 + 7) 2 k^j 


while the total number of iterations for computing y is bounded by: 


log (CA t gt/e 2 ) + (log (^f) / log („)) log (C/S 2 ) 

1 


5.1 Parameters selection satisfying the assumptions 

Four parameters of L m ; n , A tg t, S' and p should be set in the homotopy algorithm. The assumption on L m ; n 
is only for convenience. If T m in > 7inc p+ {^A. 2 kj , one can replace 7incP+ [a. 2 k'j with L m ; n in the analysis. 

Assumption [ 1 ] requires A tg t > 4 ||A* 2 ||*. This assumption on the regularization parameter is a standard 
assumption that is used in the literature to provide optimal bounds for recovery error BIEIEZI. The lower 
bound on A tg t, ensures 7 < fiiff ■ ^ we c h° ose 8 and V, we can set S' = (1 + S)p — 1 to ensure that it satisfies 
(1501) . The parameter 8 is directly related to satisfiability of (1551) in Assumption [I] For example, if <5 = 1/12, 
then 7 < 2 and Assumption |T] is satisfied with r = 2c if: 

P- (A, 9cfc 0 ) 1 

p + (A, 432c 2 fc 0 7inc) > 2 ’ 
p- (A,432c 2 fc 0 7inc) > 0. 

Theoretically, the optimal choice of 5 maximizes n subject to existence of r > 1 that satisfies (EH) and 
(l25l) . In appendix A, we provide an upper bound on the number of measurement needed for (1551) and (1551) 
to be satisfied with high probability for given S and r > 1 whenever rows of A are sub-Gaussian random 
vectors. The parameter p should be chosen to be greater than \ for (15U1) to be satisfied. 


5.2 Convergence proof 


The main part of the proof of Theorems [5] and [4] is establishing the fact that K (jW) < k. Given that 
K < k for all t, Proposition [T] ensures that hypothesis of Proposition [5J i.e., strong convexity and 

gradient Lipschitz continuity over a restricted set, are satisfied. We adapt the same strategy as in [55] and 
prove that K (x^) < k in a series of three lemmas. We have written the statement of the lemmas here, 
while their proofs are given in Appendix B. Lemma |T| states that if to\(x) does not exceed a small fraction 
of A, then x is close to xq. 


Lemma 1 If ui\(x) < SX and p- ^A, c (1 + y) 2 fco) > 0, then: 

m 11 1 / A \ j. , ^ ^ cfc 0 (1 + 7 ) ((1 + S) A + ||A*s|| ) 
max{||a; - x 0 ||, -r- (x) - (a; 0 ))} < -^- - —x- 

5A P- (A,c(l + 7 ) 2 fco) 

Note that if A > 4||A*z||* and 5 < i, we can simplify the conclusion of Lemma[l]as 
max{||a; - x 0 ||, (<j>\ (x) - <j>x (a; 0 ))} < - 3cfc 0 A (1 + 7 ) — 

AA 2p_(A,c(l + 7) 2 fco) 


( 33 ) 
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While the hypotheses of this lemma is true in the first step of every outer iteration of homotopy algorithm, 
may not be decreasing in proximal-gradient algorithm. However, the objective decreases after every 
iteration of the proximal-gradient algorithm. Thus to conclude that x^ is close to x$ in all the inner 
proximal-gradient steps we can use the following lemma: 

Lemma 2 Suppose Assumption]]} holds true, and A > At g t- If 


</>a (x) - 0 A (x 0 ) < 


3cfco<5A 2 (1 + 7 ) 

2 P- (A,c(l + 7 f kfj 


then 


, 1 ,, . . . ,,2 „ 9cfc 0 A(l + 7 ) 

m ax { ——- 11 ^4 (x — xo)|| 2 , ||ic ^c 0 ||} <- 7 - 

2p_(A,c(l 


■ if ko 


The proofs of Lemma |T| and Lemma [ 2 ] generalize the proofs of the corresponding lemmas in |45| given 
for £1 norm to norms that satisfy Condition [2] using the structure of 3||xo|| given by The last lemma 
provides an upper bound on K (x + ), where x + is produced via a proximal-gradient step on x, as long as x 
satisfies the conclusion of Lemma [2] and Assumption Q] holds. The proof of Lemma [3] uses a slightly different 
approach than the one given in [45l resulting in a simpler requirement on k in Assumption [I] 


Lemma 3 Let x + = Proxy/, (x) and suppose Assumption]]} holds true, and A > At g t- If L < 71 nc p+ 
and 



1 2 

max{—||A(x-x 0 )|| 2 ,lk-a;o||} < 


9ck 0 X (1 + 7 ) 

2 p- {a, c (1 + 7) 2 kfj 


then K (x + ) < k. 


5.3 Proof of Theorem [3] 

First we show that L t < y ln cP+ ^A, 2 ifj and K (x^) < k for all t > 0. The inequalities hold true for t = 0 

by the hypothesis. Suppose L t < l- mc p+ ^A ,kj and K ( x W) < k for some t > 0. Since 4>\(xP' 1 ) < (j)\[x^), 
by Lemma [21 we have: 



Xo 


}< 


9ck 0 X (1+7) 

2 p- ^A, c (1 + 7) 2 kfj 


By Lemma [21 Lemma [3] and Theorem [21 for any L < 7 i nc p + 



I< (Piox x ,l ( x(t) )) < k, 

K ^Proxyz, (x^'j — x^J < 2k. 


Now we can use Proposition |T] to conclude that M t+ \ < 7incP+ yA,2kj hence L t+ 1 < M t + i/7dec < 
7 inc P+ (^A,2lfj. In addition, by Lemma[3l K (x( 4+1 )) = K (Prox Aj M t+ i ( x W)) < k. 

Since Proxy/,(x*) = x* for any L > 0, by Lemmas[U[21 and[3] K (x*) < k. By Theorem[21 we have: 


K (x (4+1) - x< 4) ) < 2k, K (x (4) - x*) < 2k, 
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which yields 

A*A (V t+1) - 


= max 
aeSii.ii 


= max 
aeGn-ii 


x (t+l) _ x (t) 


(a, A* A (; 

(Aa,A (x {t+1) - z (t) )} < \jp+ ( A,l)p+ (A,2fc) 


x {t+i) _ x (t) 


(34) 


Now Proposition Q] and (l34l) ensure that all the hypotheses of Proposition [2] are satisfied with Pf = 
P- ^A, 2fc^, Lf = p+ ^2l, 2k^j . L'j = y/ p + ( A 1 1) p + 2fcj and 0 = 1. Thus the conclusion follows from 


Proposition [2] 


5.4 Proof of Theorem [4] 

Let yf £ argmin^Aj (y) . For the ease of notation let Ajv+i <— A tg t- First we show that uj\ t+1 ( y W) < SX t +i 
and K (yW) < k for t = 0,1,..., N. When t = 0, we have y^ = 0 and Xq = ||A* 6 ||*. Therefore, K(y (°)) = 0 


and 


^Xi(y (0) ) = ^min | \\A*b + Ai£||* 
Since —^- £ <9||0|| 


< 


Ao 

A*b-^A*b 


= (1 - p) A 0 < <5Ai, 
se u>\ t 

(,>*>) 


where in the last inequality we used (l30l) . Suppose ijJ\ t (y(* < SX t and K (y^ *)) < fc. By Theorem [3] 

we have: 


K 


< k. 


By (1161) . the stopping condition in the proximal gradient algorithm ensures oj\ t (y^) < S' At. Therefore, 
there exists £ £ 9||yW|| such that ||H* ( Ay M — b) + A t £||* < S' A t . Now using hypothesis (fBOl) . we get: 

wa 4+1 (y (t) ) < ||;4* (V t) -6)+At+i^|* 

< A* ^Ay^ — bj + A t £ + ||(A t+ i — A t )^|| 

< wa ( (y (t) ) + (At — At+i) < (—1 + (S' + 1 ) / 77 ) At+i < SXt+i- 
By Lemma [Hand the comment that follows it, for all t = 0,..., N, we have 


y (t) - y*t +1 


< 


y (t) - x 0 


+ ||y t+ i -x 0 || 

< cko (1 + 7 ) ((2 + S) At+i + 2 ||.A*z|| ) 

P- + k^j 

3ck 0 (1 + 7 ) At+i 


< 


P- (^,c(l + 7) 2 fc 0 ) 


Hence 


(,<•>) 

, y (t) - y* t+ 1 ) 

(»<•>) 

y (t) - vt+i 


< 3 6ck 0 (1 + 7 ) A ^ +1 

P- [a,c(1 +7) 2 fc o) 
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(a) Objective gap vs. iteration (b) Rank vs. iteration 

Fig. 1: Comparison of homotopy, proximal-gradient and accelerated proximal-gradient algorithms for problem 
1 


Now the upper bounds in (1311) and (1321) on the number of inner iterations follow from the second conclusion 
in Theorem [3] 

By (1M1) . we have 


\\y-y*\\ < \\y-yo\\ + llz/o 


9cfc 0 A tg t (1 + 7 ) 
P- (Ac(1 + 7) 2 /c 0 ) 


By convexity of we get: 


0A tgt {y) - (y*) < {v\ tgt (y), y - y*) 

< 9 cfc 0 A tg t (1 + 7) e 

P- (a, c (1 + 7) 2 &o) 


6 Numerical Experiments 

We consider two problems. The details of each problem are summarized in the following table: 



Problem 1 

Problem 2 

Objective 

±\\Avec(X) + br 2 + \\\X\l 

A||Avec(A) + 6|^ + A||A|| 12 

dimension of Xq 

300 x 300 

50 x 1000 

K(X 0 ) 

rank(Xo) = 10 

# of non-zero columns of A'o = 50 

#of samples 

m = 20000 

m = 18000 

b 

Avec(Xo) + z 

A vec(Xo) + z 

A it j sampled from 

Af(0, l/y/m) 

{—1/y/rn, 1/y/m} uniformly at rand. 

Zi sampled from 

W(—0.005,0.005) 

W(—0.005, 0.005) 


In the homotopy algorithm, Ao = ||A T &||* and Atgt = 4 ||A T 2 ||*, while in the proximal-gradient algorithm 
A = A tg t- The default values of 77 and S' in the homotopy algorithm are 77 = 0 . 6 , S' = 0 . 2 . 

Problem 1 . Figure [T| demonstrates the overall linear rate of convergence of proximal-gradient homotopy 
algorithm (homotopy) applied to this problem and compares it with proximal-gradient algorithm (PG) and 
its accelerated version (APG). As rank vs. iteration plot demonstrates, the proximal-gradient algorithm 
speeds up to a linear rate when the rank drops to a certain level, while the homotopy algorithm keeps the 
rank at a level that ensures a linear rate of convergence. 

We examine the performance of homotopy algorithm with three different values of 77 and S' in Figure [2] 
For 77 to satisfy the condition of Theorem[4l it is necessary that 77 > 0.5. However, as Figure [2] demonstrates, 
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(a) Objective gap vs. iteration 



(c) Objective gap vs. iteration 



(b) Rank vs. iteration 



(d) Rank vs. iteration 


Fig. 2: (a), (b): Performance of homotopy algorithm with S' = 0.2 and three different values of 77, 
(c), (d): Performance of homotopy algorithm with 77 = 0.6 and three different values of S' 


one can choose 77 < 0.5 and still get an overall linear rate of convergence. For example, when 77 = 0.2, at the 
beginning of the last stage where A = Atgt, X^ is not low-rank and the algorithm has a sublinear rate of 
convergence, but nevertheless the algorithm converges faster with 77 = 0.2 than 77 = 0.7. Homotopy algorithm 
appears to be even less sensitive to S'. As S' gets closer to 1, the rank of X^ jumps higher, which can cause 
a slowdown in convergence specially at the beginning of each stage. 

In Figure [3a] we have compared recovery error of the following algorithms: SVP, FPC, APGL, homotopy, 
proximal-gradient and its accelerated version. In SVP we provide the algorithm with the rank of Xq, while in 
SVP2 we use the same heuristic that is proposed in m to estimate the rank (other algorithms do not receive 
the rank of Xo).We have implemented the FPC algorithm with the backtracking procedure which improves 
the performance of the algorithm. Both APGL and APGL2 have been implemented with continuation over 
A with the latter utilizing an extra truncation heuristic proposed in m • The method of continuation for 
APGL is the same as the one proposed in SB; we reduce A by a factor of 0.7 after three iterations or 
whenever the stopping criterion is met whichever comes first. In FPC and APGL similar to the homotopy 
algorithms, Ao = ||A t ( 6)|| and A tg t = 4||A T (z)|| . We have used the default values of the parameters in 
all the algorithms. Note that APGL2 has an extra truncation procedure which improves the recovery error. 
Finally, Figure [3b] shows the objective gap for the algorithms for which the quantity is meaningful. 

Problem 2. Figure [I] demonstrates the linear convergence of homotopy algorithm for this problem and 
compares the performance with that of proximal-gradient algorithm and its accelerated version. Similar to 
problem 1, homotopy algorithm keeps the number of non-zero columns below a certain level. In homotopy 
algorithm S’ = 0.2 and 77 = 0.6. 
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(a) Recovery error vs. iteration (b) Objective gap vs. iteration 

Fig. 3: Comparison between SVP, FPC, APGL, homotopy, proximal-gradient and its accelerated version 



' 600 

^ 400 


1 
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- homotopy 


(a) Objective gap vs. iteration 


0 50 100 150 200 250 300 350 

t 

(b) Number of non-zero columns vs. iteration 



(c) Recovery error vs. iteration 


Fig. 4: Comparison of homotopy, proximal-gradient and accelerated proximal-gradient algorithms for problem 
2 
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Appendix A 

In this section we give a lower bound on the number of measurements m that suffice for the existence of r > 1 in Assumption [l] 
with high probability when A is sampled from a certain class of distributions. To simplify the notation we assume that B = J; 
therefore, ( x , y) = x T y. Given a random variable 2 : the sub-Gaussian norm of z is defined as: 

IMIv-a = inf {^ > 0 1 E ^’ 2 (y) ^ U. 

where ip 2 (x) = e ^ 2 — 1. For an n dimensional random vector w ~ P the sub-Gaussian norm is defined as 

IML, = su p IK«’>“}IL 2 - 

ues ™- 1 

P is called isotropic if E [(ic,w) 2 ] = 1 for all u E S n ~ 1 . Two important examples of sub-Gaussian random variables are 
Gaussian and bounded random variables. Suppose A : R n 1 —> R m is given by: 

(Ax)i = -^=(Ai,x) Vi E {1, 2,..., m}, (35) 

V m 

where Ai, 1 < i < m are iid samples from an isotropic sub-Gaussian distribution P on R n . Two important examples are 
standard Gaussian vector Ai ~ Af(0,I n ) and random vector of independent Rademacher variables Q. We want to bound the 
following probabilities for 0 E (0,1): 


P(p_(A, k) < 1-0) (36) 

P(p+(A,k) >1 + 0). (37) 

When Ai ~ J\f (0, I n ) for all i , one can use the generalization of Slepian’s lemma by Gordon |11| alongside concentration 
inequalities for Lipschitz function of Gaussian random variable to derive (see, for example, [171 chapter 15]): 

V m + 1 

P(s/p+(A,k) > 1 + 9) < e ~, 


whenever, 


9 > 


2 G(k) 

s/m 


Here, G is defined as: 

G(k) := E sup |(u,ff)|, 

u£-\/kl3 1 |. || nS 71-1 

where g ~ A/*(0,/ n )- For sub-Gaussian case, we use a result by Mendelson et al. M Theorem 2.3]. Using Talgrand’s generic 
chaining theorem [401 Theorem 2.1.1], the authors have given a result, which similar to the Gaussian case depends on G(k). 
Their result in our notation states: 


Proposition 3 Suppose A is given by (1351) . If P is an isotropic distribution and ||Ai||^, < ol, then there exist constants c\ 
and C 2 such that 

p~(A/y/m,k) > 1 — 0, (38) 

p+(A/y/m, k) < 1 + 0, (39) 

with probability exceeding 1 — exp (—C 2 ^ 2 m/a 4 ) whenever 

g ■> Cl^GW 
_ y/m 

Suppose Atgt = 4 ||A* 2 :||*, which sets 7 = We can state the following proposition based on Proposition [ 3 ] : 

~ _ 4 ~ _ 

Proposition 4 Let r > 1, k = 36rc/co(l + 7 ) 7 inc and k = cko(l + j ) 2 . If m > (G(2k) 2 + r 2 G(k) 2 ), then r satisfies 

Assumption^ with probability exceeding 1 — exp(c 2 (r — 1 ) 2 m/r 2 a 2 ). 

The proof is a simple adaptation of proof of Theorem 1.4 in m which we omit here. To compare this with the number of 
measurements sufficient for successful recovery within a given accuracy, by combining <[59l in the proof LemmafTland Proposition 
[3] we get: 

Proposition 5 Let r > 1, k = cko(l + 7) 2 and x* E argmin (f>\ (x). If m > ^^j?G(k) 2 , then ||x* — ||2 — C 2 r\y/cko with 

probability exceeding 1 — exp(c 2 (r — 1 ) 2 m/r 2 a 2 ). 

Note that this bound on m in case of Zi, Zi ,2 and nuclear norms orderwise matches the lower bounds given by minimax 
rates in EE Ea and EE 

1 For general psd B, the example are Ai = B~ 2 A[ with A[ ~ A/”(0, In) or A[ ■ Rademacher for all j. 
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Appendix B 

B.l Proof of Theorem [T] 

Sufficiency. First consider the case where k = 1 and x = 7 iai with 71 > 0. Note that a\ E 0||;c|| = d||ui|| because ||ai||* = 1 
for all ai E Q\\.\\ and (a\,x) = 71 = ||x||. Define: 


C = E d\\a\ ||}. 

Note that C is a convex set that contains the origin. Moreover, C is orthogonal to a\. We claim that m is satisfied with 
T^- = span C. To establish the claim, we first prove that C is symmetric and is contained in the dual norm ball. Let v E C and 
£ = a\ + v E <9||ai||. By ([TJ , (ai,£) = ||£||* = 1. Therefore, 

ai E argmax(a, £) 

° eS !HI 

and we can apply the hypothesis of the theorem (in particular statement I) to obtain an orthonormal representation for £: 

l 

£ = ®1 + y^rgbi. 

i= 1 


Now by statement II in the hypothesis we get: 


v\\* = max 77 i < ||£||* < 1 . 

i 


Let = ai — Yli =1 By the hypothesis, ||^ / ||* = maxfl, max^ 77 ^} = 1. Also, (£', a\) = 1 hence E d||ai|| and —v E C. 

Let v E spanC with ||u||* < 1. Since C is a symmetric convex set, there exists A E (0,1] such that Xv E C (i.e., C is 
absorbing in spanC). Define z = a\ -\- Xv which is in d||ai||. Since (a\,z) = \\z\\* = 1, we can write 2 : as 


k' 

z = ai + y ^VjCj, 
i=1 

where {cj|i = 1, ...,&'} C Q\\.\\ and {vi > 0|z = 1,..., k'} satisfy the hypothesis of the theorem. In particular, since 
v = 1/A i ViCi, we have max^ Vi/X < 1. Hence ||ai + u||* = maxfl, vi/X, ... u t / /A} = 1 and a\ + v E c?||ai ||. Therefore, 


d\\ai || = {ai + v\v E span C, ||u||* < 1}. 

Now suppose that x = 7 jaj with k > 1. Note that a i £ c?||ic|| since || a i 

J2i=i 7 i = ||#||. Let £ E 0||ai|| and define v = £ — Yli=i a i- We can wr ife: 


1 and (JZi =1 °i,x) = 


k k 

INI = = <£,z) = 

i= 1 i=l 

=> Vi E {1, 2,..., fc} : (f,ai) = 1 =* Vi E {1,2,..., A;} : £ E d||a«||. (40) 

Also, since Yli=i a i ^ I I 40 ft results in: 

Vi E {1,2,..., A;} : dET^. (41) 

Since £ = X^=i a i+^E c?||ai||, we have || + u|| =1 hence + u E 0|| 0 . 2 1|- By induction, we conclude that 

Ufc + n E <9||a.fc||. This implies Ill’ll* < 1. 

Let 7 / E with ||i/||* < 1 and define £' = a i + v '• We will prove that H^'H* < 1 and hence £' E c?||a;|| . 

To prove this we use induction. Define 

k 

z t = ^ a-i + v' VZ e {1,2,..., fc}. 

i=k-l-\- 1 

Note that ||zi||* < 1 since z\ = + v' E <9||afc||. Suppose ||^/||* < 1 for some l' < k. We prove that ||zj|/_|_i||* < 1. We have 

Y^i=k-l '+1 a * e T a k _ t t because J2i=k_l' a i = a k-V+Y^i=k-V +1 a i e d\\ a k-V ||- Combining this with the fact that v' E T^_ , 
we get zj/ E T^ k . Therefore, zy+i = o>k—l' + z v G 0||afc_//|| hence || 2 i/+i||* < 1. Thus ||{ , ||* = H^fcll* ^ 1- We conclude that: 

k k 

dlWI = (J2 °< +v\v e p| Tt, ||d||* < 1}. 

i=l i= 1 


(42) 
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Necessity. For any a £ Q\\-\\ > we have: 

(a, a) = 1 , 

V 6 e 6 |||, : ( 6 , a) < || 6 || 2 ||a || 2 = 1 . 

That implies ||a||* = 1 and a £ d||a||. Since a £ T a , we conclude that: 

d\\a\\ = {a + v\v £ T^ L ,||'i>||* < 1}. (43) 

Take 71 = (a\,x) = ||a;||* and let A 1 = x — 710 . 1 . If Z\i = 0, then take k = 1 and cc = 7101 . Suppose Z\i 7 ^ 0. Since 
— a; = 1 and (aj, — x ) = ||ai|| = 1, we can conclude that —x G 9||ai||. Furthermore, we have 


P T ± (x) = x — 71 .Ft (— x) = x — 71 ai = 

<n i 7l 

Now we introduce a lemma that will be used in the rest of the proof. 

Lemma 4 Suppose a £ Q\\.\\ and y £ T^~ — {0}. If z £ t3\\.\\ is such that ||y||* = ( y,z), then z £ . 


(44) 


Proof Without loss of generality assume that \\y\\* = 1. It suffices to show that if b £ Q\\.\\ and (y, b) = 1, then b £ T^~. Consider 
such b £ Q\\.\\ - By (143 ] >, ||a + y||* = 1. That results in: 

1 > (a + y, b) = (a, b) + 1 =>• 0 > (a, 6 ). 

By considering —y and —b we get that (a, 6 ) = 0. Since (a + y, b) = || 6 || = 1, we can conclude that a + y £ d|| 6 ||. Since 
(y,b) = 1 and ||y||* = 1, y £ <9||6||. Combining these two conclusions, we get: 


y e 


,a + 2/ e 


o £ 


||o -|- 6 || ^ 1 a + b £ < 


b6T„ x 


□ 


Suppose that there exist l £ {1, 2,..., k}, an orthogonal set {a^ £ \ i = 1,2,..., Z}, and a set of coefficients { 7 ^ > 0| i = 

1,2,...,!} such that x = J2i=i 7 i a i + 4 S FI \ =1 T^, and: 


E £ 


= ai + ij i ij e n T ii . w* ^ i i- 


(45) 


By Lemma si there exists a t+1 G 5||.|| such that a i+1 G n* = 1 T^: and { a i+1 ,Ai) = ||4\;||*. Take 7 i + i = (ai +1 ,A t ) = ||4\ ; ||* 
and let A t+1 = A t -' n+1 a l+1 . We have A l+1 G f|Li because {Z\;,a i+ i} C f|Li Since |^-A| = 1 and 

(a;_|_i, ^ A/) = ||a; + i|| = 1, we can conclude that G 9||a;_|_i||. Using the same reasoning as in 1441 . we have Ai _|_i G 

T£ +1 hence A l+1 G R-tl T±. 

By decomposability assumption there exists e £ R n and a subspace T such that: 


We claim that 


l+i 

E £ 


= {e + v\v £ ||v||* < 1 }. 


l+i 

e=y^aj 

i=l 
l+i 

T± = fl T i- 


(46) 


(47) 

(48) 


To prove the first claim, it is enough to show that Xwi 1 a * C <^|1 a i||• Note that ||xlii a i|| < 1 since a i — 

Y^i =i a i + a l +1 £ & Xi=i a i II which is given by (145D . Now we can write: 


l+l 

Z+l 

J+i 

Z+l 

* 

Z+l 

l + 1 = ^ CLi 

i=1 

t'pai) < 
i= 1 

i=l 

i=l 

< 

i=l 
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On the other hand, by triangle inequality, 


1+1 


E 


(Li 


+ ll a i+ill — l + 1) 


thus 


!+l 

E ai 

i= 1 


Z+l l+l 

= <E a -E a ’>- 

i=l i=l 


Therefore, Y^\=i a i £ a i||- Since a i ^ T^-j+i a = T, we conclude that: 


d 


l+l 

E‘ 


i+i 


= Ui + u|u G T- 1 , ||u|r < !}• 


To prove (|48K . we first show that Hi=i ^al £ ^' _L - Let £ = e+u with v G 0^=1 • Note that ||cq_|_i +- u||* < 1 since ai+i+v G 

Furthermore, aj_|_i + u G ni=i^a^> which in turn implies J^iJi a * + v € ^||Ei=i a i|| hence || a * + < 1- 

Additionally, we have: 

Z+l Z + l 

<E ai + t, >E a ^ = 

i=l i=l 


Z + l 

E ai 

i=l 


— I + 1. 


E“i 

i= 1 


nd||az+ill 


Hence £ G ^||X]iii a z|| an d v £ ^' _L - 

Now, let a i + 1,7 G (ESI ft?. 11 ■ Note that: 

z+i z z 

a *) = a *) + <C , > a i+i> = 1 + 1 =>• (£'> X] a *) = (£'> a z+i) = i => €' g # 

2=1 2=1 2=1 

=*«'e n r +E a * +,/e n T + 

2=1 2=1 2=1 

moreover, Y2i=i a i ^ since Yli=i a i £ 5||az+i||. This implies v G Riii which completes the proof of (I48|) . 

Because ai ^ - for all z G {1,2,...,/ + 1}, c/zm(n^jT^r) < n — l — 1. Hence there exists k < n, an orthogonal set 

{a-i G £/||.||, z — 1,2,..., fc}, and a set of coefficients { 7 ^ > 0, z G {1, 2,..., k}} such that x = Yli=i 7 i a i and: 


k 

E“i 


fe 

= (E ai + u 1 v e fl T + imi* < !}• 

2=1 


(49) 


That proves ||x|| = (J2i=l a i, x ) = J2i=i 7 i- 

To prove statement II, we first prove that G T^\ for all i,j G {1, 2,..., k}. By GSJ- 1 53 JL i aj || <1. We can write: 


k 

(%2 a i>aj) 

2=1 


fc fc 

1 => ^ ^ ai G 0||aj|| => ^ — aj G T a ^, 

2=1 2=1 


fc 



2=1 


Now the claim follows from Lemma |4] 

Let l = |{z 7 i|z 7 i ^ 0} | - If l = 0, the statement is trivially true. Suppose the statement is true when l = l' — 1 for some 
l' G {1,. .. ,n} and consider the case where 1 = 1'. Suppose that 1 77 ^-1 = max^ 1 77 ^|. By proper normalization we can assume that 
rjj = 1. Let y = We can deduce the following properties for y : 


OieT^yeT^, 

IMI* = max|+ < 1- 

_^ 1 11 ^» 11 * II ^ j u 11 * 

By the decomposability assumption Y^i=i = CLj + y £ ^|| a j|| hence ^ 1. Hence = 1. □ 

Remark 1 Let x = 7 i a i- Since T^~ = , a more general version of lemma [ 4 ] holds: 

Lemma 5 Suppose x G M n and y G — {0}. If z G B||.|| zs szzc/i t/za/: ||y||* = (y,z), then z G T^ 1 -. 

We state and prove a dual version of Lemma [5] which will be used in the proof of Lemma [T] and Lemma [ 2 ] 

Lemma 6 Let x G M n . If y G T^~, then there exists z G T^~ DB||.||* such that \\y\\ = ( y,z). 
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Proof If y = 0, then the lemma is trivially true. If y ^ 0, then: 


E fl {x | ||a;|| = 1} =>• 3z E T^~ such that —E argmax (a, z). 
Nl 112/11 aerJ-nB||.|| 


Therefore, by Lemma [5] we get 


max 

aGTj-nG||. || 


(a,z) < 





= lly|| 


□ 


B.2 Proof of Theorem [2] 


First, we introduce a lemma. 

Lemma 7 Let {ai,... , a*,} be an orthogonal subset of Q ||.|| that satisfies II in Theorem Q] Let y = y^^_ 1 with ft E t /or 

all i, then 

K(y) = \{i\Pi*0}\. 


Proof Let k' = |{z | Pi ^ 0} |. Without loss of generality assume that pi ^ 0 for i < k' and Pi = 0 for i > k' . Let 77 ^ = sgn(/3j) 
and a[ = sgn (Pi)ai for all i < k'. Since ai,... ,a^ satisfy condition II in the orthogonal representation theorem, so do aj,..., a' k ,. 

Now we show that y and ,..., a' k , satisfy condition I. By 1111) . |fci=i a i| < 1- Therefore, 


k' k' 

> <1 Z a i ’ y ) = E Iftl* 


X> 


k' 


<Ei&i 


k' 


2 — 1 


Therefore, by the orthogonal representation theorem, 


e y = Th us K(y) 


e 2/ II2 — ■ 


□ 


For any x S R n — {0} define 


l 

l(x) = min{£ | x = b\ - ,b t C C/||.||, ai S R}. 

2=1 

Define 1(0) = 0. Now the proof is a simple consequence of the following lemma: 


Lemma 8 For all x E K n , l(x) = K(x). 

Proof K(x) > l(x) by the definition of l(x). We prove that K(x) = l(x) by induction on K(x). When K(x) E {0,1}, the 
statement is trivially true. Suppose the statement is true when K(x) E {0,1,2,..., k — 1}. Consider the case where K(x) = k. 
By way of contradiction, suppose l(x) < K(x). Let 


k 

x = ^2^ia i: (50) 

2=1 

where 71 ,... , 7 k and ai,..., are given by the orthogonal representation theorem. If l(x) = 1, then: 

k 

^7 i a i = atibu 
2 = 1 

for some ai 7 ^ 0 and 61 E Q\\-\\- Since |ai| = ||ai 6 i|| = ||x|| = 52i=i7i5 either b\ or —b\ can be written as convex combination 
of ai,..., afc which contradicts the fact that 61 E Q\\.\\ • 

If l(x) = l > 1, we can write x as: 


l 

x = '^a i b i , (51) 

2 = 1 

with { 61 ,... ,5/} C Q ||. ||. By turning to —bi without loss of generality we assume that a* > 0 for all i. Let u = 2a\bi and 
v = 2 a ibi an( i note that x = (u + v)/2. Let C = Cone{ai,a 2 ,... , a*.}. Let intC and bdC denote the interior and the 

boundary of C , respectively. Note that u (£ intC because by Lemma[7] if u E intC, then K(u ) = k\ however, l(u) = 1. Now we 
consider two cases for v. 
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Fig. 5: Relative position of v! and vf on the line segment between u and v. 


Case 1. If v E intC, then we can write v as a conic combination of ai, a- 2 ,..., a/- with positive coefficients: 

l k 

v = 2 = y^CjOi, 

i =2 i= 1 

where c% > 0 for all i. 

Case 2. If v (£ intC. let L = {9u + (1 — 9)v \ 9 E [0,1]}. Since L intersects the interior of C at x and { u , v } ^ intC, there exists u', v' 
such that LflbdC = {u ', v r }. Suppose v' is on the line segment between v and x (see Figurel5l). Let L' = {9u + {l — 9)v > \ 9 E 
[0,1]} and note that x E L'. Since v' E bdC, it can be written as conic combination of at most k — 1 of ai, .. ., a^. Without 
loss of generality assume that v' = y^_p f3i(Li. For some 9 E (0,1): 


k 

x = 9u + (1 — 9)v' = b± + P\ai , 

i =2 

where = 29a\ and (3^ = (1 — 9)/3i. Using the representation in (1501) , we get: 

k 

o'lfcl = 7l ai + ^(7 i - P[)ai. 
i =2 


We have Z(a^ 6 i) = 1, and by LemmaUJ K(a' 1 bi) = 1 + |{i| 7 i ^ = 2,..., /c}|. Therefore, 7 ^ = (3[ for all £ = 2, . . ., k and 

bi = a\. Combining the previous fact with Roll and ED. we get: 

k l 

x - aiai = (71 - ai)ai + ^ 7 ;a; = (52) 

i=2 i=2 

If ryi = ai, by the induction hypothesis k = l , which is a contradiction. Now, suppose 71 — ai ^ 0. In both cases we 
produced a point y = v such that K(y ) = k and l(y ) <1 — 1. We can continues this procedure until we get a y such that 
K(y) = k and l(y) = 1, which gives us the contradiction. □ 


B.3 Proof of Proposition [2] 

In iteration t + 1 when the backtrack procedure stops, the following inequality holds true: 

. (x^ t \x^ t ’^ 1 ^) = min f(x^) + (V f(x^), x — x^) -\ -—- life — x ^ || + Allfcll 

t_|_1 x 2 II 112 

< min 4>\ (fc) -|—||fc — x^ || . (53) 

On the other hand, by JT9J, we have 

ct>\(x ( - t+1 '>) < m Lf (x *< t+1 >), 

which ensures Mt+i < 7i n cL/ since mi,(x^ t \ is non-decreasing in L. By CD. we have: 

<(>A(z (t) ) > cpx(x*) + ^-|a: (t) - x*|| 2 - (54) 
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If we confine x to {ax* + (1 — a)x^ | 0 < a < 1}, inequality (1531) combined with (1541) results in 

< min {<b\(ax* + (1 — a)x^) -\ - t ~ l ~ 1 \\x^ — x* II } 

o-G [0,1] 2 II 112 

< min {a(j)\(x*) + (1 — a)(f)\(x^) -\ --- \\x^ — x* II } 

a€[0,l] 2 II II2 

< min {a<t> X (x*) + (1 - a)<(>A(® (t) ) + - 7mcL/ (</i a (®M) _ <p x (x*))}. 

aG[0,l] flf 

The RHS of the above inequality is minimized for a* = min{l, }. Therefore, we get 

<t> a(^ ( * +1) ) - </>a(^*) < (1 - Ot* + 7 m c Lf )((j) X (x^) - (/>\(x*)) < (1 - )(</> A(^ (t) ) - <^a(®*))- 

/*/ 47 inc L/ 

To prove (l23t . we note that the backtrack stopping criteria ensures 

<)>A(® (t+1) ) < /(® (t) ) + {V/(® (t) ),® (t+1) - ®«> + |x( t+1 ) - ®W||* + a|® ( * +1) | 

< /(i«) - <Mt + iOz( t+1 > - *(*)) + £,*(*+!) _ *(*)) + ^±i|U(‘+D _ x w|l 2 + a||*(*+ 1 )|| 

2 II 112 II II 

< f(x (t) ) - ^±I||® (t+1) -®W||^ + (£,® (t) - ® (t+1) ) + A||® (t+1) || 

< <(> A (® (t) ) - ^|ti|| a; ( t + 1 ) _ X W|£ (55) 

The hypothesis m ensures Mt +1 > fif. Combining (| 16f) and (|55|) and using the lower and the upper bounds on Mt+ 1 , we 
get the desired result 


w A (® (t+1) ) < |M t+ i(a: (t) - ® (t+1) ) + V/(® (t+1) ) - V/(® (t) ) 

< e(M t+1 + L')\\x^+ 1 '> - x^W 

J II II 2 

< 6»(1 + ) J2M t +i(<j>x (® (t) ) - <t>\ (x(*+i))) 

IWt+l v 

< 6»(1 + —)j2"/ illc Lf{<px(x( t '>) - <p\(x*)). 


B.4 Proof of Lemma [T| 

By the hypothesis there exists £ S 9||x|| such that | A * (A x — b) — A£||* < 5A. Therefore, we can write 

<5A||® — ®o|| > ||z — ®o|| ||xl*(j4ai — b) + A£||* > {{x — ro), A* (Ax — b) + A£) 

= ((® — xo), A*(A(x — xo)) — A*z + A£) 

= ||A(® — ®o)||§ — [x — xo,A*z) + \(x — xo, £) 

> \\A(x — ®o)|li — II® — ®o||||-A*2||* + A(||x|| — ||zo||). (56) 

Now we lower-bound | x 11: 

||®|| = ||z — xo + ®o|| > ||P Txo x(® — ®o) + ®o|| - \\Pt Xo (® — ®o)||. 

By Lemma[ 6 ] there exists s S Tji such that (s, P T _l (x — xq)) = \\P T ± (® — ®o) and ||s||* = 1. Note that e x „ +s S 9||®o|l 

U XQ 11 XQ 

hence \\e Xo + s\\* < 1. Therefore, we get: 

\\P T ± (x — xo) + xq > (e X Q + s, P T ± (x — xo) + xo) > \\P T ± (x — xo) + llrcoll, 

II x 0 || x 0 || ^0 

||x|| — ||xo|| > |P Ta , o _L(^ — xo)|| — \\p Txq (x - xo)|. (57) 

Combining and m- we get 

<5A||® - ®o|| > - >l (||' P T X0 J -( a: — *o)| - 11-Pr^o ( x - ®o)||) - ||® - ® 0 ||||A*®|r + m(® - m 0 )|||. 
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By applying triangle inequality to ||® — ®o||, we obtain 

(A(l + 5) + ||A* 2 |p)||p Ta:o (s - s 0 )| > (A(l - S) - ||A*z||*)|p Txo _l(s - x 0 )|| + ||A(s - so) 111- 

That yields 


II® - s 0 || ,, 

0 

IT 

1 

- so) 

p 

I ±X Q 

-L (s — So) 1 

II® — ®o II 2 

I Ptt 

1 ±x c 


- ®o)| 

I 2 


P T X0 (® - XO) -- 

< (1 + 7)-|T-rr- < (1 + 7 ) vW- 

||Pr X0 (x ~ S 0 )|| 

Using the definition of the lower restricted isometry constant, we derive 

P-(A, c( 1 + 7 ) 2 fco)||a: - s 0 || 2 < ||A(s - so )|| 2 ((1 + <5)A + ||A*z||*)||p Txo (s - so)|| 

< v / cfco((l + S)X + ||A*z||*)|p Tio (x - so) 

< v / cfco((l + S)X + || A*z||*)||s — s 0 || 2 , 

which yields the following bounds 


||a? — || 2 — 

||a; — ®o|| < 


Vc&o((l T T || A*.z||*) 

2_ p_(A, c(l + 7 ) 2 fc 0 ) 

cfco(l + 7)((1 + <5)A + || A*z||*) 


By convexity of 4 >\, 

4>\{x) - ()>\{xq) < (A£ + A*(As — b),x— s 0 ) < 


p- (A, c(l + 7) 2 fco) 

cfco<5A(l + 7)((1 P <5)A P ||A*z||*) 


p_(A, c(l + 7) 2 fco) 


(58) 


(59) 

(60) 


B.5 Proof of Lemma [2] 

Let A = - — 3 / C ? 0 ^iV~73 . . ■ We can write 

2p_(A,c(l + 7 )^fc 0 ) 

0 \{x) < (f>\{x 0 ) + <5AA 


\\\ Ax - b lll - \\\ Ax 0 - b \\i < VIMI - INI) P <5AA 

< A||®o — #|| + SXA (61) 

If ||® — a?o|| 5; half of the conclusion is immediate. To get the second half, we can expand the left hand side of 1611) to 


get: 


-||A(s — so )|| 2 < A||s — so|| + (s — so ,A*z) + SXA 

< (A + ||A*z||*)||s — so|| + <5AA 

< (- +<5)AA < A—. 

“ 4 ' ~ 2 


Suppose ||® — ®o|| > A, then from J611) we get: 


/A(||s|| - ||s 0 ||) < l||As 0 - b||l - 11|As - b||| + 5A||s - s 0 || 

< — l||A(s - s 0 )|| 2 + (s - so, A*z) + <5A||s - soil 

< —i||A(s - s 0 )|| 2 P ||A*z||*||s - so|| P <5A||s - s 0 ||. 

By using S3 and triangle inequality we get: 

(A(l + <5) + ||A*z|r)||p Txo (s - s 0 )|| > (A(l - S') - ||A*z|n||p Txo s (s - s 0 )| P ±||A(s - s 0 )|| 2 . 
Using the same reasoning as in the proof of Lemma |l] we get the desired results. 
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B.6 Proof of Lemma 0 

By first order optimality condition there exists £ E c?||®“*"|| such that: 

A£ = L(x — #"*") — V/(®) 

= L(x — x+) — A* {Ax — b) 

= L(x — #”*”) — A* (A(pc — #o)) + A* z 

Note that £ = e x + + v for some v E TJ 7 . By Lemma [ 6 ] there exists v' E T^_ D B||.||* such that ( v',v) = ||u||. Since 
e x + + E d||®“*"||, \\e x + +u / ||* < 1. Therefore, we can write: 

ll£ll = \\ e x+ +«ll > ( e x+ + v '> e x+ + v ) = ll e x+ll + IMI =► K{x + ) = \\e x + 1| < ||£||. 

Let £ = 2i=l 7i a L where ai,..., a; and 71 ,... , 7 ; are given by the orthogonal representation theorem. Since 7 ; < 1 for all 

i. I > ||£||. If ||f|| > fc, we can define rt = ^ 7=1 a ii then 

fcA < (u , A£) = (n, L(x + — x)) — (Au, A(x — a;o)) + («, j4*z) 

< L||x + — a:|| + \J P+{A , fc)fc||A(a; — rco)|| 2 + k||A*z||* 

3fcA 

Since </>x(;c+) < (f)\(x), by Lemma ID we have: 

II + II << II + III 11 11 << 9 c/coA(l + 7) 

^ — x\\ < aP — ®o + II® — ®o|| < 


- L W x+ ~ x ll + - *o)|| 2 - 


P- (-A.c(l + 7) 2 ^o) ’ 

n^-«o)ni< ^°y: + i v 

p- (A, c(l + 7 ) J fco) 


Define 


a = 7incP+(^, 2fc) 
P 2 = P+(A,k) 


9cfco(l + 7 ) 
P-(A,c(l +7) 2 fco) ’ 
9cfco(l + 7 ) 
P-0,c(l +7) 2 ko) 


We can rewrite (l62t as: 


— a — pVI < 0 => VI < — ((3 + V ft 2 + 3a) < 2y/a. 


(62) 


But this contradicts Assumption [I] so ||£|| < k hence K(x+) < k. 
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